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Abstract 

We propose a novel approach that jointly removes reflec¬ 
tion or translucent layer from a scene and estimates scene 
depth. The input data are captured via light field imaging. 
The problem is couched as minimizing the rank of the trans¬ 
mitted scene layer via Robust Principle Component Analy¬ 
sis (RPCA). We also impose regularization based on piece- 
wise smoothness, gradient sparsity, and layer independence 
to simultaneously recover 3D geometry of the transmitted 
layer. Experimental results on synthetic and real data show 
that our technique is robust and reliable, and can handle a 
broad range of challenging layer separation problems. 


1. Introduction 

Reflections and transparency are prevalent in real scenes, 
and are typically viewed as undesirable. Unfortunately, it 
is non-trivial to remove them. The observed image / can 
be generally modeled as a linear combination of a trans¬ 
mitted layer T (which contains the scene of interest) and a 
secondary layer S (which contains the reflection or trans¬ 
parency). Typical examples include a picture behind a glass 
cover and a scene blocked by a sheer curtain. Extracting 
S from / is a problem that is inherently ill-posed: we have 
two unknowns T and S but only one equation. To make this 
underconstrained problem more tractable, existing solutions 
either impose additional priors (e.g., through user inputs or 
spatial regularization) [16, 17] or use more constraints (e.g., 
by capturing more photographs) [29, 31, 18, 10]. 

In this paper, we present a new computational imaging 
solution by exploiting emerging light field imaging tech¬ 
niques. A light field (LF) captures an array of images from a 
grid of viewpoints. It can be viewed as a single-shot multi¬ 
view imaging system. The multi-view attribute enables re¬ 
liable depth estimation [11, 32, 14, 6] that eliminates the 
need of homography assumption in [29, 9, 31, 10]. Our 
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Figure 1: Left: Our portable camera array with a reconfig- 
urable baseline. Right: We demonstrate how to exploit such 
a light field camera for layer separation tasks. 

technique begins with estimating an initial disparity map 
using SIFT fiow [20]. We then warp all LF views to the 
reference view (in our case, the central camera) to form an 
image stack. We show that the image stack exhibits low- 
rank property, and we apply Robust Principle Component 
Analysis (RPCA) for simultaneous layer separation and dis¬ 
parity refinement. 

A unique advantage of our LF-based solution is that we 
can represent scene geometry as a single disparity map un¬ 
der which the resulting warped image stack will be low- 
rank. In contrast, the warped image stack in previous multi¬ 
view approaches is only low-rank when scene geometry is 
planar (via homographic warping on the cropped common 
region) and they can break down on complex scenes (Fig. 4 
and 5). We conduct experiments on both synthetic and real 
data. In particular, we construct a 3 x 3 mini LF array that is 
portable and can be controlled by a single tablet. Results on 
static and dynamic scenes show that our technique is robust 
and reliable and can handle a broad range of challenging 
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layer separation problems. 

2. Related Work 

The problem of image layer separation is ill-posed, and 
typically relies on additional priors or constraints. Earlier 
approaches rely on user inputs to provide priors on the two 
layers. Levin et al. [16] develop a user-assisted system to 
label image gradients to one of the two layers. An automatic 
method can then be used to search for a decomposition that 
minimize the total amount of edges and corners, using a 
database of natural image patches [17]. 

To automate the layer separation process, more recent 
techniques use multiple images, either from a fixed view¬ 
point with varying camera settings (such as fiash, focus, and 
polarization), or from multiple viewpoints through the use 
of a hand-held camera [13, 29, 9, 31, 28, 18, 10]. In the 
case of the fixed viewpoint approach, [8, 27, 15] exploit 
the effect of refiection under different rotation angles of a 
polarizer. Agrawal et al. [4] show how a fiash/no-fiash im¬ 
age pair can be used to remove both refiections and high¬ 
lights through gradient filtering and integration. Schechner 
et al. [26] propose to vary the focus of the camera for elim¬ 
inating refiection artifacts. The use of different modes of 
capture is complementary to our technique. 

Methods for separating layers using multiple-viewpoint 
images are based on the intuition that the transmitted layer 
and refiection undergo different motions under changing 
views. Szeliski et al. [29] propose to separate the two layers 
by estimating global and local motions. Gai et al. [9] study 
the statistics of natural images to extract both the motion of 
the two layer motions and their mixing coefficients. In a 
similar vein, Tsin et al. [31] assume locally planar motion 
and require dense image capture to estimate both the depth 
and appearance of each layer through EPI analysis. Sinha et 
al. [28] speed up the process by adopting piecewise planar 
scene models and extends the semi-global matching [13] 
for reliable layer separation. More recently, Guo et al. [10] 
correlate all images through homography and then conduct 
low-rank decomposition to effectively separate the refiec¬ 
tion layer from the transmitted layer. Although these tech¬ 
niques are effective, the requirement of capturing multiple 
and often many images of scene from different viewpoints, 
and hence time instances, significantly limits their applica¬ 
bility. Eurther, there is an implicit assumption that the scene 
is mostly planar and can be rectified via a homography. 

We seek a single-shot solution through LE imaging. The 
concept of LE imaging can be traced by integral photog¬ 
raphy by Lippmann [19] in which a lenslet array is used 
to emulate acquisition of multiple viewpoints [3, 24, 21]. 
Hand-held plenoptic cameras are now commercially avail¬ 
able [22] and mobile camera arrays [1, 2] will be on the 
market soon. In our experiments, we use a mini LE camera 
array to support on-site acquisition. Techniques that capi- 



k Warp and unroll 



1 % 

1^1 


y/ \ ^0// 

1^2 





4 ^._ ^ \ 

-- 

■‘^22/ 

aH 



■ □'S — 


(a) LigI 

ht Field 

(b) Warped view stack I 


Transmitted 
Layer 7^ 




^p-d(p)0r'\^ 



Secondary 
Layer Si 

p - d(p)(pi 


p 


Warped 
View Vi 


(c) Correspondence in the warping 


Eigure 2: Warping light field views to an image stack. Every 
light field view (a) is unrolled as a row vector and stacked 
into a matrix (b) using the disparity map. We decompose it 
into the transmitted and secondary matrices (c). 


talize on the availability of such cameras include [12, 11] 
(variational shape from LE data), [33] (line assisted stereo 
matching), [30] (depth estimation of glossy surfaces), and 
[6] (robust stereo matching). 

3. Problem Formulation 

In our work, we capture the LE of the scene (transmitted 
layer) that has been superimposed with a secondary layer 
(e.g., refiection). The inputs are LE images from different 
viewpoints, and we take the central view as the reference 
view. Our goal is to separate the layers for the reference 
view by exploring redundant information that is available 
from the other views. To account for scene appearance in 
all the views, we estimate the disparity map of the transmit¬ 
ted layer; this map is used to align all the LE views with 
respect to the reference to facilitate layer separation. The 
disparity map estimation and layer separation steps are done 
iteratively. 

We first explain our notations. Our LE consists of a 2D 
grid of K = N X N viewpoints, with each image having 
a resolution of w x h. The i-th 2D sub-aperture image is 
unrolled as a ID image vector Vi, i G {1, 2,..., AT}; the 
term maps index i to its position within the 2D image 
grid. We assume the images are uniformly sampled hori¬ 
zontally and vertically with an identical baseline and d rep¬ 
resents the disparity map of the reference view with respect 
to its one-hop neighbor view. We use Vi to represent the 
warped result from Vi to the reference using d. As with Vi, 
Vi and d are also unrolled into ID row vectors G 
Given d, we can compute V^s and stack them to form ma- 
















trix / G The warped LF images will now contain 

the warped transmitted and secondary layers: Vi = TiV Si. 
We can similarly stack all Ti and Si into two matrices T and 
S G Fig. 2 illustrates the warping process. 

Our goal is to recover T, S, and d from a single equation 
I = T S S. Since this problem is ill-posed, we need to im¬ 
pose additional constraints as in [10]. First, the transmitted 
layer should be the same after disparity warping to the refer¬ 
ence view, and therefore should be of low rank. In contrast, 
the warped secondary layer should have pixel-wise low co¬ 
herence across views because they are warped using the dis¬ 
parity of the transmitted layer rather than their own dispar¬ 
ity map, and therefore S should be sparse. In addition, the 
transmitted and secondary layers should be independent and 
their gradients sparse. Putting all these together, we formu¬ 
late the layer separation problem as energy minimization: 

minimize rank iT) -\- 

T,S,d,uj 

Xi\\DT © DS\\o + X2\\DI -DT- DS\\% 

+ A3||DT||o + A4||Z)5||o (1) 

+ Asljd — will + Aell-Dwili 

subject to I = T S;T y 0] S y 0^ 

where || • ||o, || • ||i, and || • \\f are and Frobenius 

norm respectively, cc is an intermediate variable for refining 
the disparity map d, 0 represents the element-wise multi¬ 
plication, and D is the finite difference operator applied to 
an image on both x and y direction. 

In this formulation, the first term forces the rank of ma¬ 
trix T to be low. The second and third terms force the gra¬ 
dients of the two layers to be mutually independent. The 
fourth and fifth terms imposes the sparse gradient prior on 
natural images. The last two terms employ ^^-TV to refine 
the disparity map d. We choose ^^-TV instead of £‘^-TY 
as the regularization term for two reasons. First, a dispar¬ 
ity map is largely piecewise constant. Second, the norm 
measure ||(i — cc||i is commonly used for evaluating the per¬ 
centage of bad pixels on disparity maps [25]. Therefore, 
||(i — cc||i can be interpreted as the convexification of bad 
pixel percentage in d. We further impose hard constraints 
that T and S be non-negative (T y 0, S y 0). The opti¬ 
mization problem, however, is NP-hard. We follow [5] to 
solve an alternative convex relaxation problem: 

minimize ||T||. + 

T,S,d,oo 

Xi\\DT 0 DS\\i + X2\\DI -DT- DSfp 

+ A3||Dr||i+A4||L>5||i (2) 

+ Aslld — w||i + AgllDwIli 
subject to / = T + S';T^0;5'^0 

where nuclear norm || • ||^ replaces the rank function and 
norm replaces norm in Eq. 1 . 


The new formulation now allows convex optimization. 
However, the 3D-warping function I(d) is still highly non¬ 
linear. In order to linearize the warping function, we further 
formulate Vi as: 

Vi{p) = Vi{p-d{p)ct>i), (3) 

where p is the image pixel coordinate. In order to convert 
the objective function into a convex model, we follow [11] 
to linearize the warped images using first order Taylor ap¬ 
proximation on disparity d^^^ at iteration t. For each image, 
we have: 

vV\p) ~ Vi{p-S^\p)<l2i) + {d^^+^\p)-S^\p))-Ji, 

(4) 

where Ji G is 

Ji = Ui\\^_^v{p- d^*\p)4>i). (5) 

Ikdl 

Letting Ji = diagQi), we rewrite the constraint in Eq. 2 
as: 

K 

I + Y,ei{^<iJi)=T + S-,ThO;ShO, (6) 

i=l 

where / = I{d^^^), Ad = —d^^\ and {ei} is the stan¬ 

dard basis for The constraint can be regarded as lin¬ 
earizing the 3D-warping operation with respect to the dis¬ 
parity map d. 

Finally, we combine all priors to simultaneously solve 
for the transmitted and secondary layers as well as the dis¬ 
parity map by solving the following convex optimization 
problem: 

minimize ||T||^ + AiHL^T © D5'||i 

T,S,d,uj 

+ A 2 IIDJ -DT- DSfp + A3||Z)T||i + X4DS\U 

+ Asljd — will + AqII-DwIIi (7) 

K 

s.t.I + ei(ArfJi) = T + S;Tt0-,Sh0. 

4. Optimization 

In this section, we describe how to optimize the objec¬ 
tive function defined in Eq. 7. The algorithm is outlined in 
Algorithm 1 and illustrated in Fig. 3. 

4.1. Initialization 

Our approach starts by warping the sub-aperture images 
to the center view. Previous studies assume global paramet¬ 
ric motion (e.g., homographies [29, 9, 10]). Despite its com¬ 
putational efficiency and robustness, this approach is unable 
to handle more complex parallax. In reality, the transmitted 
layer is unlikely to be planar and a dense 3D reconstruction 
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Figure 3: Illustration of our processing pipeline (see Section 4 for details). 


would be needed for warping the images. Conceptually, we 
can apply LF stereo matching such as [32, 14, 6] to first es¬ 
timate the 3D geometry. However, with the secondary layer 
corrupting the transmitted layer, direct depth estimation in¬ 
curs significant errors. In our implementation, we use SIFT 
flow [20] for correspondence, since it has been shown to be 
effective for registering refiective scenes [18, 23]. 

Similar to the optical fiow, SIFT flow only allows de¬ 
scriptors to be matched along the fiow vector Wi{u) — 
{wix{u),Wiy{u)) which is composed of the horizontal and 
the vertical components. This fits well to our model since 
the relative motion between the sub-aperture images and 
the reference image should approximately follow the fiow. 
The initial disparity is then obtained by averaging local 
flows, i.e.. 



Wi{u)^Wi{u) 

Wi{uY(j)i 


( 8 ) 


4.2. Iterative Optimization 

Given the initial disparity estimation, we use the recently 
proposed Augmented Lagrange Multiplier (ALM) with Al¬ 
ternating Direction Minimizing (ADM) strategy [10] to op¬ 
timize our objective function 7. Specifically, we can sep¬ 
arate the objective into individual sub-problems by intro¬ 
ducing five auxilliary variables: A = T^B = DT^ C = 
DS^ E = d — uj^F = Duj. We also use an intermediate 
variable G to represent I + ^i{^dJi). Under our for¬ 

mulation, the augmented Lagrangian function can now be 


represented as: 

£(T, S, d,uj,A,B,C,E,F) 

= IITL + Allis O C||i + A 2 IIS/ -B- C\\l 

+ A3||S||i + A4||C'||i + A5||S||i + A6||F||i 

+ $(Li,C?-T-S) 

+ $(L2, Al -T) + $(L3, B-DT) + $(L4, C - DS) 

+ $(^5, E — d -\- co) + $(iyg, E — Dlo), 

(9) 

where ^(X, Y) = (X, Y)-\-^ | jU11^, /i is a positive scalar, 
and Li,..., 1/6 are Lagrange multipliers. The goal of ALM 
is to find a saddle point of £(T, S, d, A, B, C, E, F), which 
approximates the solution of the original problem. We 
adopt the alternating direction method to iteratively solve 
the subproblems. The solutions and steps for each sub¬ 
problems are listed in the Appendix (attached as supple¬ 
mentary material). 

Once we obtain the solutions at each iteration, we further 
update the multipliers as: 

= L\ + - 5^+1) 

=Ll + i/{B*+^ - Dr*+i) 

= Ll + 

Ll+^=Ll + fi\E*+^-doj^+^). 

Algorithm 1 shows the complete process. The termina¬ 
tion condition is when the change of the objective function 
between two consecutive iterations is ultra small (0.1 in our 
experiments). The inner loop terminates when — 
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Figure 5: Warping both transmitted and secondary layers 
using (a) homography vs. (b) disparity map. Disparity map 
produces more consistency than homography on the trans¬ 
mitted layer. Both transformations produce high incoher¬ 
ence on the secondary layer. 
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Figure 7: Refocusing results. We demonstrate depth-guided 
refocusing using the depth map and transmitted layer image 
recovered by our algorithm. Close-ups show that color and 
depth boundaries are well-aligned. 


< II10 ^||f OT the maximum number of 
iterations is reached. 


Algorithm 1: Layer separation and Depth Estimation 


Input: Raw Light Field Data R 

Initialize: Ai,Ae > O.T « = S' ° = w ^ ^ ° = 

= C° = E° = F° = 0,d° = 0,t = 0,ii° >0, 

n > 1 

while i < K do 

Compute SIFT flow of view Vi w.r.t. center view 

Vref'^ 

Initialize disparity map _ 

Update I{d) by warping Vi to center view Vref\ 


Iteration: 

while not converged do 
while not converged do 
Update 

Ad 


Update 

T t+1 T t+1 T t+1 T t+1 T t + 1 T t+1 

-^1 5-^2 5-^3 5-^4 5-^5 ’ -^6 


Eq. 10; 

^^+1 = niV', 
t = t + 1; 


via 


Update d = d ^ + Ad 
Update I ; 

Output: Separated transmitted layer T, secondary 
layer S and disparity map d. 


5. Experiments 

We have conducted experiments on both synthetic and 
real data. All experiments are conducted on an Intel i7 PC 
(3.2GHz CPU, 16GB RAM) with the same set of parame¬ 
ters. We compared our results to two state-of-the-art tech¬ 
niques [18] and [10], by using the authors’ source code with 
default parameters. 

We first add synthetic reflections by superposing an ad¬ 
ditional layer to the Stanford LF images [ ]. The resolution 
of the synthetic images is of 1024 x 1024 and the motion of 
the additive layer is set to 20 pixels between adjacent views 
opposite to camera motion. Fig. 4 shows that our technique 
outperforms these alternative solutions in both accuracy and 
visual quality. This illustrates the importance of recover¬ 
ing the 3D shape of the transmitted layer. The multi-image 
technique of [10] uses homography (i.e., planes) as priors 
to register multiple images onto a common viewpoint. In 
our examples (e.g., the Stanford Bunny), the transmitted 
layer is non-planar and exhibits complex depth variations. 
As a result, [10] produces relatively large errors and ghost¬ 
ing artifacts due to image misalignment. In contrast, our 
technique has significantly less artifacts while recovering a 
relatively high quality disparity map. To illustrate the limi¬ 
tation of homography in transforming 3D scenes, we com¬ 
pare the transformed layers shown in Fig. 5. Disparity based 
warping produces more consistency than homography on 
the transmitted layer. 

The technique of [18] is most similar to ours. It also 
models the transformation of the transmitted layers across 
different views as a fiow field and uses SIFT fiow for im¬ 
age warping. Therefore it is expected to better handle non- 
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Figure 4: Results on synthetic data. The recovered secondary layer has been enhanced. Column 1 shows the sample input 
images; Columns 2-4 show results using our technique, [18] and [10]. For each technique, we show the recovered transmitted 
layer and secondary layer with close-up views. 


planar transmitted layer as shown in column 3 in Fig. 4. 
However, it computes the flow fleld only once (at the begin¬ 
ning). Consequently, the separation quality is heavily re¬ 
liant on the quality of flow initialization. For example, the 
bunny on the transmittance layer appears blurred in Fig. 4 
since the initial disparity estimation is erroneous. 

By comparison, our technique incorporates disparity es¬ 
timation and layer separation into an iterative joint opti¬ 
mization framework. The benefits of our technique can be 
seen in Fig. 4, with better detail recovery and better overall 
quality of layer separation. 

For real experiments, we need to capture LF images with 
a reasonable baseline between adjacent viewpoints. We did 
not use the Lytro [22] because it has an ultra-small base¬ 
line that limits its working range to only about 6 inches, 
whereas existing camera arrays are too bulky for practical 
use. We built our own portable LF array consisting of 9 Mi¬ 
crosoft LifeCam HD-6000 USB cameras on a SD-printed 
grid (Fig. 1). The resolution of each camera is 2560 x 1440, 
and the baseline can be set to either 1, 2, or 3 inches. To 
capture static scenes, we connect all cameras to a Key nice 
HI088 10-port hub powered by an Anker Astro Pro2 exter¬ 
nal battery pack. A single HP Stream 7 tablet is used to 
trigger individual cameras and store data. It takes around 
1 second to take all 9 shots at full resolution. To capture 
dynamic scenes, we connect the cameras to a workstation 
equipped with 3 PCI-E USB 3.0 adaptors, each having 4 
dedicated 5Gbps channels. This configuration allows us to 


record HD (720p) LF videos at 30 fps. We pre-calibrate the 
camera using the technique described in [34] . 

For validation, we captured some scenes with a reflec¬ 
tive layer and others with a translucent layer. We first cap¬ 
ture a LF of a painting within a glass frame using the 3-inch 
baseline. This is a typical problem that [10] aims to solve. 
Our method produces comparable results. However, it is 
worth noting that [10] requires users to manually And four 
corresponding comers in a view for computing the homog- 
raphy. We instead automatically compute the disparity map 
without any user input. In the second example, we capture 
a figurine behind a translucent layer of cloth using the 1- 
inch baseline. Our method is able to reliably recover the 
3D geometry of the figurine as well as remove the effect of 
cloth layer. To use [10], we select four feature points on the 
images and approximate a homography for warping the im¬ 
ages. Their results exhibit clear visual artifacts due to their 
inability to account for arbitrary depth variation. 

Next, we capture three objects made of different mate¬ 
rials behind a reflective glass. This emulates the museum 
setting of photographing 3D artifacts. These objects, espe¬ 
cially the toy truck, have clear depth variations and the par¬ 
allax across the LF views violates the homography model. 
Consequently, both the recovered transmitted layer and the 
secondary layer from [10] exhibit ghosting artifacts due to 
misalignment of views. The technique of [18] partially re¬ 
duces these artifacts as initial SIFT flow better register the 
images. However, the SIFT flow still has large deviation 





















Figure 6: Results on real scenes. From top to bottom: capturing a painting within a glass frame (row 1); a figurine behind 
a translucent layer of cloth (row 2); a copper statue, a plastic toy and a ceramic vase behind glass (last 3 rows). For each 
technique, we show recovered transmitted and secondary layers. Note that [18] (column 3) and [10] (column 4) crop the 
original image and the reflection layer’s contrast has been boosted for better visualization. 


from the actual disparity map and their results exhibit arti¬ 
facts on heavily saturated regions due to misalignment. 

Our technique is able to generate better results. More 
importantly, with the help of the disparity map, we are able 
to align the views and eliminate most of the refiection lay¬ 
ers while preserving fine geometric details and texture, as 
seen in Fig. 6. Our layer separation solution also produces 
a high quality 3D depth map, with which we can perform 
IBR effects such as depth-guided refocusing (Fig. 7) on the 
transmitted layer. Fig. 8 shows our results on a dynamic 
scene with a toy truck moving behind glass. The bottom 
row shows results of removing the fast moving refiection. 
To the best of our knowledge, our solution is the first to 
perform reliable layer separation on dynamic scenes. 

We examined our LF camera in a variety of environ¬ 
ments, and found that the 1-inch baseline provides enough 
view changes for almost all practical scenes that are 4-6 feet 


away. Also, a 3 x 3 LF is sufficient for nearly all cases. 
More views will further improve the low-rank constraint in 
RPCA optimization but is also more computationally ex¬ 
pensive. Our method takes about 7 minutes on average to 
process one LF video frame (containing 9 views at a resolu¬ 
tion of 640 X 480). The code of [10] takes about 3 minutes to 
finish a image sequence of the same size. The author of [18] 
reports a running time of about 5 minutes for a 500 x 400 
image sequence containing up to 5 images. 

As with previous techniques, we assume that the trans¬ 
mitted layer is dominant with the contribution of the sec¬ 
ondary layer being relatively small. This ensures that the 
SIFT flow algorithm will mostly choose feature points from 
the transmitted layer to produce mostly correct warping. If 
the assumption is violated, the detected feature points will 
come from a mixed pool of two layers. Since our iterative 
refinement process is local, it may not be able to overcome 
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Figure 8: Results on a dynamic scene with moving objects 
in both layers. Left: input image frames, changes in reflec¬ 
tion are highlighted. Right: transmitted layer recovered by 
the proposed method. 

the large errors. 

We experimented on a synthetic scene dataset where we 
control the blending of the two layers with a blending pa¬ 
rameter a. We apply our layer separation technique for dif¬ 
ferent values of a. We compute the percentage of incor¬ 
rectly recovered pixels in both layers where we use 0.1 (for 
intensity range [0, 1]) as the threshold to determine if a re¬ 
covered pixel is incorrect. Fig. 9 shows the layer separation 
accuracy versus a. For small a (e.g., in range [0,25]%), 
we are able to obtain good results. The performance signif¬ 
icantly degrades when a is above 35% and our algorithm 
fails when a is above 50%. 

6. Conclusion 

We have presented a novel technique that automatically 
separate the transmitted and secondary layers. At the core 
of our technique is the use of light held imaging to acquire 
multi-view images. With approximate scene depth of the 
transmitted layer, we can warp all light held views to the 
reference view to form an image stack. The corresponding 
transmitted stack is expected to be of low rank, while the 
secondary layer is of low coherence and hence sparse. We 
start with SIFT flow to generate the initial depth map and 
then apply an iterative optimization scheme based on Ro¬ 
bust PCA (RPCA) for layer separation and depth map re- 
flnement. It is worth noting that our technique handles dy¬ 
namic scenes (e.g., removing reflections from video), which 
would be almost impossible for traditional methods using 
an unstructured collection of viewpoints. 

An implicit assumption of our technique is that the trans¬ 
mitted layer is predominant so that SIFT flow can produce a 


Figure 9: Reconstruction accuracy vs. Transparency. 
We apply different blending parameters for combining the 
transmitted and secondary layers on the Stanford Bunny 
scene. We use a light held 3x3 views with a resolution 
of 1024 X 1024 and compute the percentage of incorrectly 
recovered pixels for both layers. 


reliable initial estimation of the disparity map of the trans¬ 
mitted layer. We plan to investigate the structure of the sec¬ 
ondary layer to relax this assumption. We would also like 
to try our technique on the Pelican [1] or Light [2] mobile 
LF camera which is expected to be on the market soon and 
compare their results with those from our light held setup. 
Another interesting direction is to estimate 3D shape of the 
secondary layer as well, by reformulating our problem us¬ 
ing two unknown disparity maps. 
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